Fundamental Techniques in Data Science with R
Leftover slides from last week
Data manipulation
Basic analysis (correlation & t-test)
Pipes
Deviations and modeling
#) to clarify what you are doing
R-scripts
RStudio projectsFunctions have parentheses (). Names directly followed by parentheses always indicate functions. For example;
matrix() is a functionc() is a function(1 - 2) * 5 is a calculation, not a functionPackages give additional functionality to R.
By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.
The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:
A list of available packages can be found on CRAN
There are two ways to load a package in R
library(stats)
and
require(stats)
The easiest way to install e.g. package mice is to use
install.packages("mice")
Alternatively, you can also do it in RStudio through
Tools --> Install Packages
R in depthA workspace contains all changes you made to R.
A saved workspace contains everything at the time of the state wherein it was saved.
You do not need to run all the previous code again if you would like to continue working at a later time.
Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.
R by default saves (part of) the code history and RStudio expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
RStudioUse common sense and BE CONSISTENT.
Browse through the tidyverse style guide
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.
Intentional spacing makes your code easier to interpret
a<-c(1,2,3,4,5) vs;a <- c(1, 2, 3, 4, 5)at least put a space after every comma!
library(MASS) # for the cats data library(dplyr) # data manipulation library(haven) # in/exporting data library(magrittr) # pipes library(mice) # for the nhanes data
transform(): changing and adding columnsdplyr::filter(): row-wise selection (of cases)table(): frequency tablesclass(): object classlevels(): levels of a factororder(): data entries in increasing orderhaven::read_sav(): import SPSS datacor(): bivariate correlationsample(): drawing a samplet.test(): t-testhead(cats)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 F 2.1 7.2 ## 5 F 2.1 7.3 ## 6 F 2.1 7.6
str(cats)
## 'data.frame': 144 obs. of 3 variables: ## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... ## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ... ## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
fem.cats <- cats[cats$Sex == "F", ] dim(fem.cats)
## [1] 47 3
head(fem.cats)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 F 2.1 7.2 ## 5 F 2.1 7.3 ## 6 F 2.1 7.6
heavy.cats <- cats[cats$Bwt > 3, ] dim(heavy.cats)
## [1] 36 3
head(heavy.cats)
## Sex Bwt Hwt ## 109 M 3.1 9.9 ## 110 M 3.1 11.5 ## 111 M 3.1 12.1 ## 112 M 3.1 12.5 ## 113 M 3.1 13.0 ## 114 M 3.1 14.3
heavy.cats <- subset(cats, Bwt > 3) dim(heavy.cats)
## [1] 36 3
head(heavy.cats)
## Sex Bwt Hwt ## 109 M 3.1 9.9 ## 110 M 3.1 11.5 ## 111 M 3.1 12.1 ## 112 M 3.1 12.5 ## 113 M 3.1 13.0 ## 114 M 3.1 14.3
dplyrfilter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")
## Sex Bwt Hwt ## 1 F 2.1 7.2 ## 2 F 2.1 7.3 ## 3 F 2.1 7.6 ## 4 F 2.1 8.1 ## 5 F 2.1 8.2 ## 6 F 2.1 8.3 ## 7 F 2.1 8.5 ## 8 F 2.1 8.7 ## 9 F 2.1 9.8
class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"
levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)
## ## Female Male ## 47 97
head(cats)
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 4 Female 2.1 7.2 ## 5 Female 2.1 7.3 ## 6 Female 2.1 7.6
sorted.cats <- cats[order(cats$Bwt), ] head(sorted.cats)
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 48 Male 2.0 6.5 ## 49 Male 2.0 6.5 ## 4 Female 2.1 7.2
cats.numbers <- cbind(Weight = cats$Bwt, HeartWeight = cats$Hwt) head(cats.numbers)
## Weight HeartWeight ## [1,] 2.0 7.0 ## [2,] 2.0 7.4 ## [3,] 2.0 9.5 ## [4,] 2.1 7.2 ## [5,] 2.1 7.3 ## [6,] 2.1 7.6
rbind(cats[1:3, ], cats[1:5, ])
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 4 Female 2.0 7.0 ## 5 Female 2.0 7.4 ## 6 Female 2.0 9.5 ## 7 Female 2.1 7.2 ## 8 Female 2.1 7.3
cor(cats[, -1])
## Bwt Hwt ## Bwt 1.0000000 0.8041274 ## Hwt 0.8041274 1.0000000
With [, -1] we exclude the first column
cor.test(cats$Bwt, cats$Hwt)
## ## Pearson's product-moment correlation ## ## data: cats$Bwt and cats$Hwt ## t = 16.119, df = 142, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.7375682 0.8552122 ## sample estimates: ## cor ## 0.8041274
What do we conclude?
plot(cats$Bwt, cats$Hwt)
Test the null hypothesis that the difference in mean heart weight between male and female cats is 0
t.test(formula = Hwt ~ Sex, data = cats)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group Female mean in group Male ## 9.202128 11.322680
plot(formula = Hwt ~ Sex, data = cats)
boys <-
read_sav("boys.sav") %>%
head()
It effectively replaces head(read_sav("boys.sav")).
Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like
boys <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))
With the pipe:
boys <-
read_sav("boys.sav") %>%
transform(hgt = hgt/100) %>%
filter(age > 15) %>%
subset(select = c(hgt, wgt, bmi))
Benefit: a single object in memory that is easy to interpret
Your code becomes more readable:
f(x) becomes x %>% f()rnorm(10) %>% mean()
## [1] -0.2342249
f(x, y) becomes x %>% f(y)boys %>% cor(use = "pairwise.complete.obs")
## hgt wgt bmi ## hgt 1.0000000 0.6100784 0.1758781 ## wgt 0.6100784 1.0000000 0.8841304 ## bmi 0.1758781 0.8841304 1.0000000
h(g(f(x))) becomes x %>% f %>% g %>% hboys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4
nrow(cats)
## [1] 144
cats.outl <-
cats %>%
filter(Hwt < mean(Hwt) + 3 * sd(Hwt),
Hwt > mean(Hwt) - 3 * sd(Hwt))
nrow(cats.outl)
## [1] 143
cats %>% filter(Hwt > mean(Hwt) + 3 * sd(Hwt))
## Sex Bwt Hwt ## 1 Male 3.9 20.5
%>% pipe%$% pipe%T>% pipe. in a pipeIn a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.
cats %>% plot(Hwt ~ Bwt, data = .)
VS
cats %$% plot(Hwt ~ Bwt)
The
. can be used as a placeholder in the pipe.
cats %$% t.test(Hwt ~ Sex)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group Female mean in group Male ## 9.202128 11.322680
is the same as
t.test(Hwt ~ Sex, data = cats)
cats.test <- cats %$% t.test(Bwt ~ Sex) cats.test
## ## Welch Two Sample t-test ## ## data: Bwt by Sex ## t = -8.7095, df = 136.84, p-value = 8.831e-15 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.6631268 -0.4177242 ## sample estimates: ## mean in group Female mean in group Male ## 2.359574 2.900000
\[\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathrm{E}[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}.\]
\[t = \frac{\bar{X}-\mu}{\hat{\sigma}/\sqrt{n}}.\]
\[\sigma^2_X = \mathrm{E}[(X - \mu)^2].\]
Can you identify which part of these equations are the deviations?
Deviations tell what the distance is for each value (observation) to a comparison value.
The arithmetic mean is a very informative measure:
\[ \] \[\text{The mean itself is a model: observations are}\] \[\text{merely a deviation from that model}\]
Deviations summarize the fit of all the points in the data to a single point
The mean is the mathematical expectation. It represents the observed values best for a normally distributed univariate set.
plotdata %>%
mutate("Mean" = X - mean(X),
"Mean + 3" = X - (mean(X) + 3)) %>%
select("Mean", "Mean + 3") %>%
colSums %>%
round(digits = 3)
## Mean Mean + 3 ## 0 -300
The mean minimizes the deviations
Throughout statistics we make extensive use of squaring. \[ \] \[\text{WHAT ARE THE USEFUL PROPERTIES OF SQUARING}\] \[\text{THAT STATISTICIANS ARE SO FOND OF?}\]
During the practical exercises this week we will learn to calculate and explore squared deviations. During these exercises you’ll learn to minimize the deviations.
Next week we’ll jump to regression. We’ll see how deviations still play a role in that framework: